Skip to content

SSE2/AVX2 optimized get_checksum2()/MD5 for x86-64, and MD5P8 whole-f…#23

Closed
Chainfire wants to merge 1 commit into
RsyncProject:masterfrom
Chainfire:CSUM2-BUILD
Closed

SSE2/AVX2 optimized get_checksum2()/MD5 for x86-64, and MD5P8 whole-f…#23
Chainfire wants to merge 1 commit into
RsyncProject:masterfrom
Chainfire:CSUM2-BUILD

Conversation

@Chainfire
Copy link
Copy Markdown
Contributor

…ile checksum

  • MD5 optimization in block matching phase:

MD5 hashes computed during rsync's block matching phase are independent
and thus possible to process in parallel. This code processes 4 blocks
in parallel if SSE2 is available, or 8 if AVX2 is available. An increase
of performance (or decrease of CPU usage) of up to 6x has been measured.

A prefetching algorithm is used to predict and load upcoming blocks, as
this prevents the need for extensive modifications to other parts of
the rsync sources to get this working.

This remains compatible with existing rsync builds using MD5 checksums.

  • MD5P8 whole-file checksum:

Splits the input up into 8 independent streams (64-byte interleave), and
produces a final checksum based on the end state of those 8 streams. If
parallelization of MD5 hashing is available, the performance gain (or
CPU usage decrease) is 2x to 6x compared to traditional MD5.

The rsync version on both ends of the connection need MD5P8 support
built-in for it to be used.

xxHash is still preferred (and faster), but this provides a reasonably
fast fallback for the case where xxHash libraries are not available at
build time.

…ile checksum

- MD5 optimization in block matching phase:

MD5 hashes computed during rsync's block matching phase are independent
and thus possible to process in parallel. This code processes 4 blocks
in parallel if SSE2 is available, or 8 if AVX2 is available. An increase
of performance (or decrease of CPU usage) of up to 6x has been measured.

A prefetching algorithm is used to predict and load upcoming blocks, as
this prevents the need for extensive modifications to other parts of
the rsync sources to get this working.

This remains compatible with existing rsync builds using MD5 checksums.

- MD5P8 whole-file checksum:

Splits the input up into 8 independent streams (64-byte interleave), and
produces a final checksum based on the end state of those 8 streams. If
parallelization of MD5 hashing is available, the performance gain (or
CPU usage decrease) is 2x to 6x compared to traditional MD5.

The rsync version on both ends of the connection need MD5P8 support
built-in for it to be used.

xxHash is still preferred (and faster), but this provides a reasonably
fast fallback for the case where xxHash libraries are not available at
build time.
@Chainfire
Copy link
Copy Markdown
Contributor Author

Chainfire commented Jun 19, 2020

Update to #6 . Based on your diff of my original submit, brought up to latest master source compatibility, tested on all the compiler versions and distros as #20, added some benchmarks as well.

Probably still for patches rather than master, but it's better than it was.

@WayneD
Copy link
Copy Markdown
Member

WayneD commented Jun 19, 2020

Thanks! I've turned this into the latest md5p8.diff in the rsync-patches repo (which is now also on GitHub).

@WayneD WayneD closed this Jun 19, 2020
@WayneD WayneD self-assigned this Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants